{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DPO (Direct Preference Optimization) training and its datasets\n",
    "\n",
    "📌 DPO (Direct Preference Optimization) datasets for LLM training, typically consist of a collection of answers that are ranked by humans. This ranking is essential, as the RLHF process fine-tunes LLMs to output the preferred answer. \n",
    "\n",
    "📌 The structure of the dataset is straightforward: for each row, there is one chosen (preferred) answer, and one rejected answer. The goal of RLHF is to guide the model to output the preferred answer.\n",
    "\n",
    "📌 And Huggingface's `DPOTrainer` expects a very specific format for the dataset. \n",
    "\n",
    "📌 Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the Anthropic/hh-rlhf dataset below.\n",
    "\n",
    "📌 To synthetically create DPO datasets for a set of prompts, you can create the answers with GPT-4/3.5 which will be your preferred answers, and with Llama-2-13b or similar class of models, create the rejected responses. \n",
    "\n",
    "It’s a smart way to bypass human feedback and only rely on models with different levels of size/performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# util method for readying a DPO dataset\n",
    "\n",
    "def dpo_data(dataset_id, split='train_prefs'):\n",
    "    # Function to return a Dataset object containing\n",
    "    # processed data with 'prompt', 'chosen', and 'rejected' fields\n",
    "    dataset = load_dataset(dataset_id, split=split)\n",
    "\n",
    "    # Capture the original column names for removal later.\n",
    "    original_columns = dataset.column_names\n",
    "\n",
    "    def map_function(samples):\n",
    "        return {\n",
    "            \"prompt\": samples[\"prompt\"],\n",
    "            \"chosen\": samples[\"chosen\"],\n",
    "            \"rejected\": samples[\"rejected\"]\n",
    "        }\n",
    "\n",
    "    # Apply the mapping function to the dataset to extract required fields.\n",
    "    # 'batched=True' allows processing in batches for efficiency.\n",
    "    return dataset.map(map_function, batched=True,\n",
    "                       remove_columns=original_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------\n",
    "\n",
    "These datasets also tend to be a lot smaller than fine-tuning datasets. To illustrate this, the excellent neural-chat-7b-v3–1 (best 7B LLM on the Open LLM Leaderboard when it was released) uses 518k samples for fine-tuning (Open-Orca/SlimOrca) but only 12.9k samples for RLHF (Intel/orca_dpo_pairs). \n",
    "\n",
    "In this case, the authors generated answers with GPT-4/3.5 to create the preferred answers, and with Llama 2 13b chat to create the rejected responses. \n",
    "\n",
    "It’s a smart way to bypass human feedback and only rely on models with different levels of performance.\n",
    "\n",
    "The core concept of PPO revolves around making smaller, incremental updates to the policy, as larger updates can lead to instability or suboptimal solutions. From experience, this technique is unfortunately still unstable (loss diverges), difficult to reproduce (numerous hyperparameters, sensitive to random seeds), and computationally expensive.\n",
    "\n",
    "**This is where Direct Preference Optimization (DPO) comes into play. DPO simplifies control by treating the task as a classification problem. Concretely, it uses two models: the trained model (or policy model) and a copy of it called the reference model. During training, the goal is to make sure the trained model outputs higher probabilities for preferred answers than the reference model. Conversely, we also want it to output lower probabilities for rejected answers. It means we’re penalizing the LLM for bad answers and rewarding it for good ones.**\n",
    "\n",
    "By using the LLM itself as a reward model and employing binary cross-entropy objectives, DPO efficiently aligns the model’s outputs with human preferences without the need for extensive sampling, reward model fitting, or intricate hyperparameter adjustments. It results in a more stable, more efficient, and computationally less demanding process."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
